In this Section the first of two fundamental paradigms for effective cross-validation. With this first approach - called boosting - we take a 'bottom-up' approach to fine tuning the proper amount of capacity a model needs by building a model of the form above sequentially one unit at-a-time. In other words, we begin with a very low capacity model (just a bias) and gradually increase its capacity by adding additional units (from the same family of universal approximators) one unit at-a-time until we achieve a model with minimal validation error. While in principle any universal approximator can be used with boosting, boosting is often used as the cross-validation method of choice when employing tree-based universal approximators (and in particular stumps).
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
from mlrefined_libraries import math_optimization_library as optlib
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
# demos for this notebook
regress_plotter = nonlib.nonlinear_regression_demos_multiple_panels
classif_plotter = nonlib.nonlinear_classification_visualizer_multiple_panels
static_plotter = optlib.static_plotter.Visualizer()
basic_runner = nonlib.basic_runner
classif_plotter_crossval = nonlib.crossval_classification_visualizer
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
# import autograd functionality to bulid function's properly for optimizers
import autograd.numpy as np
# import timer and other basic libs
import time
import copy
import math
from IPython.display import clear_output
# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
With the general boosting procedure we perform cross-validation by constructing our model one unit at-a-time. In each round of boosting we add a single unit to our model and tune its parameters - and its parameters alone - optimally, keeping them fixed at these optimally tuned values forever more. Doing this we gradually increase the nonlinear capacity of our model, and can therefore determine an ideal model (one with low validation error) with high resolution. Moreover by adding one unit at-a-time and tuning only the parameters corresponding to this new unit we keep the optimization / training at each step roughly constant. This tends to make boosting quite an effecient in practice.
Using the dial visualization of cross-validation introduced in Section 11.2.2 we can think about the boosting procedure as starting with the dial set all the way to the left (at a model with extremely low capacity). As we progress through rounds of boosting we are turning this dial very gradually clockwise from left to right, increasing the capacity of the model very gradually in search of a model with low validation error.
As with all feature learning scenarios for the sake of both consistency and to better address the technical issues associated with each main type of universal approximator, with boosting we virtually always choose units from the same family of universal approximators. Moreover whether we use kernel, neural network, or tree-based units we prefer units with low capacity for boosting. Each of the basic examples given in Section 11.1.3 in introduce the three popular universal approximators fit this bill, being of low capacity compared to more advanced exemplars (e.g., deep neural networks and trees, which we detail in future Chapters) - polynomials (kernel), single layer networks (neural networks), and stumps (trees). So for example, in the case of kernels monomials from a large degree polynomial can be chosen as units. With neural networks each unit can be choosen as a single hidden layer unit, meaning each is a parameterized nonlinear function of the same form (as detailed in Section 11.1.2 and Chapter 13). In the case of trees, a set of stumps (a very popular choice of universal approximator used with boosting due to how they are typically constructed) can be custom-built for a given dataset as described in Section 14.1.
The main reason for preferring low capacity units with boosting - regardless of the type of universal approximator employed - is so that the resolution of our model search is as high as possible. When we start adding units one-at-a-time, turning the cross-validation dial clockwise from left to right as detailed above, we want the dial to turn smoothly so that we can determine the lowest validation error model as finely as possible. Using high capacity units at each round of boosting results in a coarse resolution for our model search, as we it roughly jerk it from left to right leaving large gaps in our model search. This kind of low resolution search could easily result in us overshooting the ideal model.
<< SMOOTH VERSUS JERKY DIAL TURNING >>
The same could be said as to why we add only one unit at-a-time, tuning its parameters alone at each round of boosting. If we added more than one unit at-a-time, or instead re-turned every parameter of every unit at each step of this process not only would we have significantly more computation to perform at each step but this performance difference between subsequent models could be quite large, and we might miss out on an ideal model with low validation error. In other words, by adding one unit-at-a-time our sequence of models does not overfit nearly as quickly as they would if we tuned all of the parameters of every unit simultaneously at each step of the process.
With boosting our final aim is to construct a model of the generic form one unit at-a-time
\begin{equation} \text{model}\left(\mathbf{x},\Theta\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} + f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_B\left(\mathbf{x}\right)w_B \end{equation}where $f_1,\,f_2,\,...\,f_B$ are nonlinear features all of which are taken from a single family of universal approximators, and $w_0$ through $w_B$ (along with any additional weights internal to the nonlinear functions) are represented in the weight set $\Theta$. To build this model we begin with a set of $B$ units
\begin{equation} f_{1}\left(\mathbf{x}\right),\,f_{2}\left(\mathbf{x}\right),...,f_{B}\left(\mathbf{x}\right) \end{equation}of the kind described above from any family of universal approximators and - in the process of boosting - will sequentially add each of them one at-a-time to our model in performing a total of $M$ rounds of boosting. At each round of boosting we determine which unit - when added to the running model - best lowers the model's training error (i.e., the error over the data on which we are training). We measure the corresponding validation error provided by this update and in the end - after all rounds of boosting are complete - use the lowest validation error measurement found to decide which round provided the best model / number of units.
With this 'bottom-up' approach to model selection we begin with a bias-only model (the lowest capacity model imaginable) which we call $\text{model}_0$ of the form
\begin{equation} \text{model}_0^{\,}\left(\mathbf{x},\Theta\right) = w_0 \end{equation}where here our parameter set $\Theta_0^{\,} = \left\{ w_0^{\,}\right\}$. To begin we first tune the bias parameter $w_0$ by minimizing an appropriate cost (depending on whether we are solving supervised or unsupervised problem) over this variable alone. For example, if we are performing regression employing the Least Squares cost and we minimize
\begin{equation} \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_0^{\,}\left(\mathbf{x},\Theta \right) - \overset{\,}{y}_{p}^{\,}\right)^{2} = \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\,} - \overset{\,}{y}_{p}^{\,}\right)^{2} \end{equation}where the training dataset is given as $\left\{\mathbf{x}_p,\,y_p\right\}_{p=1}^P$. Minimizing this quantity gives the optimal value for our bias $w_0^{\,} \longleftarrow w_0^{\star}$. We fix the bias at this value forever more, with our original model now being
\begin{equation} \text{model}_0^{\,}\left(\mathbf{x},\Theta_0^{\,}\right) = w_0^{\star}. \end{equation}Now in our first round of boosting we want to determine which of our $B$ units can - when added to the model above and its parameters tuned appropriately - best lowers $\text{model}_0$'s training error.
In the case of neural network units since each unit takes precisely the same form we can simply add any one of them to our model and tune its parameters by minimizing an appropriate cost. Denoting $f_{s_1}$ the neural network unit we add to $\text{model}_0$, we its weights by minimizing an appropriate cost function over training data e.g., in the case of Least Squares regression
\begin{equation} \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_0^{\,}\left(\mathbf{x}_p,\Theta_0^{\,}\right) + f_{s_1}^{\,}\left(\mathbf{x}_p^{\,}\right)w_{s_1}^{\,} - {y}_{p}^{\,}\right)^{2} = \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + f_{s_1}^{\,}\left(\mathbf{x}_p\right)w_{s_1}^{\,} - \overset{\,}{y}_{p}^{\,}\right)^{2}. \end{equation}Here since the bias weight has already been set optimally we only need tune the weight $w_{s_1}$ as well as the parameters internal to the nonlinear unit $f_{s_1}$. After minimizing the appropriate cost we then update our model as
Notice here we have fixed the linear combination weight $w_{s_1} \longleftarrow w^{\star}_{s_1}$ and denote $f_{s_1} \longleftarrow f^{\star}_{s_1}$ the final version of $f_{s_1}$ with its internal parameters set optimally. Also note that $\Theta_1^{\,}$ contains $w_0^{\star}$, $w_1^{\star}$, and any internal optimally tuned parameters of $f^{\star}_{s_1}$ which are fixed at their optimally determined values.
When using kernel or tree-based units this first round of boosting looks slightly different but ends very much the same way. Because these units from these universal approximators do not all take a single form like neural network units, in order to determine the first unit $f_{s_1}$ to add to our model we must try out all of our features $f_1,\,f_2,\,...,f_B$ individually. We do this by minimizing an appropriate cost over each (added to $\text{model}_0$), choosing the best unit $f_{s_1}$ as the one that provides the lowest training error. In other words, with kernel and tree-based units we solve $B$ subproblems by minimizing cost functions over a single nonlinear feature and its linear combination weight alone, e.g., in the case of Least Squares regression the $b^{th}$ of these subproblems takes the form
\begin{equation} \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_0^{\,}\left(\mathbf{x}_p,\Theta_0^{\,}\right) + f_b^{\,}\left(\mathbf{x}_p^{\,}\right)w_b^{\,} - {y}_{p}^{\,}\right)^{2} = \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + f_b^{\,}\left(\mathbf{x}_p\right)w_b^{\,} - \overset{\,}{y}_{p}^{\,}\right)^{2}. \end{equation}Again here the bias weight has already been set optimally so we need only need tune the weight $w_b$ as well as any parameters internal to the nonlinear unit $f_b$ (i.e., when a tree unit is employed). The nonlinear feature that produces the smallest training error value from these $B$ subproblems corresponds to the individual feature $f_{s_1}$ that helps best explain the relationship between the input and output of our training dataset, and so then we add it to our model precisely as shown in equation (7) to form $\text{model}_1$.
In the second round of boosting we determine the second most important nonlinear unit and corresponding linear combination weight, and add them to our model.
If we employ a neural network unit we can pick any unit $f_{s_2}$ (since they all take precisely the same form, unlike kernel and tree units) and tune its parameters by minimizing an appropriate cost e.g., in the case of Least Squares regression
\begin{equation} \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_1^{\,}\left(\mathbf{x}_p, \Theta_1^{\,}\right) + f_{s_2}^{\,}\left(\mathbf{x}_p^{\,}\right)w_{s_2}^{\,} - {y}_{p}\right)^{2} = \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + f^{\star}_{s_1}\left(\mathbf{x}_p\right)^{\,}w_{s_1}^{\star} + f_{s_2}^{\,}\left(\mathbf{x}_p^{\,}\right)w_{s_2}^{\,} - {y}_{p}\right)^{2}. \end{equation}Here we need only minimize the cost with respect to $w_{s_2}$ and parameters internal to $f_{s_2}$, all other parameters having been previously set. Doing this we then have our second model
Here the internal parameters of $f_{s_2}$ along with $w_{s_2}$ are now set to optimally determined values - which we denote as $f^{\star}_{s_2} \longleftarrow f_{s_2}$ and $w_{s_2}^{\,} \longleftarrow w_{s_2}^{\star}$ respectively. Also note that $\Theta_2^{\,}$ contains $w_0^{\star}$, $w_{s_1}^{\star}$, and $w_{s_2}^{\star}$ as well all internal parameters of $f^{\star}_{s_1}$ and $f^{\star}_{s_2}$.
When employing kernel or tree-based units, in order to determine the second best unit to add to our model we once again must once must try out all $B$ features one-by-one. (If we wish to restrict ourselves to only those $B-1$ features we have not yet used we may discard $f_{s_1}$ from our list of candidates.) Sweeping through the our $B$ features we try out each one individually by minimizing a cost creating $B$ subproblems e.g., in the case of Least Squares regression $b^{th}$ such subproblem has us minimize
\begin{equation} \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_1^{\,}\left(\mathbf{x}_p, \Theta_1^{\,}\right) + f_b^{\,}\left(\mathbf{x}_p^{\,}\right)w_b^{\,} - {y}_{p}\right)^{2} = \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + f^{\star}_{s_1}\left(\mathbf{x}_p\right)^{\,}w_{s_1}^{\star} + f_b^{\,}\left(\mathbf{x}_p^{\,}\right)w_b^{\,} - {y}_{p}\right)^{2}. \end{equation}With each of these subproblems $w_0^{\star}$, $w_{s_1}^{\star}$, as well as any parameters internal to $f^{\star}_{s_1}$ have already been set optimally - we only tune the weight $w_b$ and any internal parameters $f_b$ in each subproblem instance. The feature $f_{s_2}$ that produces the smallest training error from these $\left(B\right)$ subproblems corresponds to the second most important feature that helps explain the relationship between the input and output of our dataset. We then add this feature to model giving $\text{model}_2$, which takes precisely the form given above in equation (2).
More generally at the $M^{th}$ round of boosting we determine the $M^{th}$ most important feature (where $M\geq 2$) by following the same pattern outlined above. Regardless of the type of universal approximator employed, in the end our $M^{th}$ model takes the form
\begin{equation} \text{model}_{M}^{\,}\left(\mathbf{x},\Theta_{M}^{\,}\right) = \text{model}_{M-1}^{\,}\left(\mathbf{x},\Theta_{M-1}^{\,}\right) + f^{\star}_{s_{M}}\left(\mathbf{x}\right)w_{s_{M}}^{\star} \\ = w_0^{\star} + f^{\star}_{s_{1}}\left(\mathbf{x}\right)w_{s_{1}}^{\star} + f^{\star}_{s_{2}}\left(\mathbf{x}\right)w_{s_{2}}^{\star} + \cdots + f^{\star}_{s_{M}}\left(\mathbf{x}\right)w_{s_{M}}^{\star} \end{equation}Here $f_{s_M}$ is either the $M^{th}$ unit chosen. This can be any unit when employing neural network universal approximators, or when using kernel / tree-based approximators this single unit must be found by comparing all $B$ units (or all remaining $B-\left(M-1\right)$ units if we discard those previously chosen) in the manner detailed above and determining which provides the greatest decrease in training error over the prior model. In either case $w_{M}^{\star}$ and any internal parameters of the nonlinear feature $f^{\star}_{s_M}$ are set optmially, and the index set $\Theta_M$ then contains all internal parameters to the chosen nonlinear features $f^{\star}_{s_1},\,f^{\star}_{s_2},\,...,f^{\star}_{s_M}$ as well as the bias / linear combination weights $w_0^{\star},\,w_{s_1}^{\star},...,w_{s_M}^{\star}$.
In building $\text{model}_M$ we have generated a sequence of $M+1$ models in total $\left\{\text{model}\left(\mathbf{x},\Theta_m^{\,}\right)\right\}_{m=0}^M$ which progressively grow in nonlinear capacity from $m = 0$ to $m = M$. Moreover gradually increasing the nonlinear capacity in this way also gives us fine-grained control in selecting an appropriate model via validation errror, as the difference in performance in terms of training / validation errors between subsequent models in this sequence can be quite smooth.
As mentioned previously, each type of universal approximator (kernels, neural networks, and trees) comes with its own unique set of technical perculiarities that need be addressed for successful use in practice. Many of these universal approximator-specific technical issues are detailed in the next three Chapters, and in the specific context of boosting these technicalities still exist. Note however that when using kernel or neural network universal approximators often times an individual bias term is included with each unit as it is being tested during each round of boosting, and when added in to the subsequent model. That is while the generic form of our $M^{th}$ model does not change, with the addition of individual biases it is written as
\begin{equation} \text{model}_{M}^{\,}\left(\mathbf{x},\Theta_{M}^{\,}\right) = \text{model}_{M-1}^{\,}\left(\mathbf{x},\Theta_{M-1}^{\,}\right) + w_{0,s_{M}}^{\star} + f^{\star}_{s_{M}}\left(\mathbf{x}\right)w_{s_{M}}^{\star} \\ = \left(w_0^{\star} + \sum_{m=1}^M w_{0,s_{m}}^{\star} + \right) + f^{\star}_{s_{1}}\left(\mathbf{x}\right)w_{s_{1}}^{\star} + f^{\star}_{s_{2}}\left(\mathbf{x}\right)w_{s_{2}}^{\star} + \cdots + f^{\star}_{s_{M}}\left(\mathbf{x}\right)w_{s_{M}}^{\star}. \end{equation}Adding an individual bias to each unit while fitting with kernel or neural network units allows for greater flexibility, and generally better results (tree-based approximators already have individual bias terms 'baked in' to them and so already have this capability, as detailed in Section 13.1.). For example, with regression an individual bias allows both kernel and neural network units to be adjusted 'vertically', that is along the output space.
Here we show a simple example of boosting using stump features for regression. In particular here we use the noisy sinusoidal dataset introduced in Example 1 of the previous Section, which consists of $P=21$ low dimensional points. We construct a set of $B = 20$ stump features for this dataset as described in Section 14.1, and illustrate the result of $M = 100$ rounds of boosting (meaning that many of the stumps are used multiple times). We split the dataset into $\frac{2}{3}$ training and $\frac{1}{3}$ validation, which are highlighted in light blue and yellow respectively in the left panel of the figure below. As you move the slider from left to right the boosting steps proceed, with the resulting fit shown in the left panel with the original data (which is color-coded - blue for training and yellow for validation data) and corresponding training / validation errors shown in the right (with blue denoting training error and yellow validation error). Notice how smooth the resolution of this boosting-based model search is in terms of how closely subsequent models match in terms of their training and validation errors. This smooth search allows us to determine - with very high resolution - an ideal stump-based model for this dataset.
## This code cell will not be shown in the HTML version of this notebook
# import data
csvname = datapath + 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = copy.deepcopy(data[:-1,:])
y = copy.deepcopy(data[-1:,:] )
# import booster
mylib2 = nonlin.boost_lib3.stump_booster.Setup(x,y)
# choose normalizer
mylib2.choose_normalizer(name = 'standard')
# choose normalizer
mylib2.make_train_valid_split(train_portion = 0.66)
# choose cost|
mylib2.choose_cost(name = 'least_squares')
# choose optimizer
mylib2.choose_optimizer('newtons_method',max_its=1)
# run boosting
mylib2.boost(50)
# animation
frames = 51
anim = nonlin.boosting_regression_animators_v2.Visualizer(csvname)
anim.animate_trainval_boosting(mylib2,frames)
In this example we illustrate the same kind of boosting for two-class classification using using a dataset of $99$ datapoints that has a roughly circular decision boundary (this dataset was first used in Example 2 of the previous Section). We split the data randomly into $\frac{2}{3}$ training and $\frac{1}{3}$ validation and employ single hidden layer tanh units for boosting. Once again we animate this process over a range boosting steps - here we perform of $30$ of them - where we add one unit at-a-time. As you move the slider from left to right the results of each added unit - in terms of the nonlinear decision boundary and resulting classification - are shown in the top left (where the original data is shown) , top right (where the training data alone is shown), and bottom left (where the validation data is shown) panels. A plot showing the training / validation errors at each step of the process is shown in the bottom right panel.
Once again here, moving the slider back and forth, we can see that the model providing the smallest validation error appears to (more or less) provide the best nonlinearity / decision rule for the entire dataset (i.e., the current data as well as any data that might be generated by the same process in the future).
## This code cell will not be shown in the HTML version of this notebook
# import data
csvname = datapath + 'new_circle_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = copy.deepcopy(data[:-1,:])
y = copy.deepcopy(data[-1:,:] )
# import booster
mylib6 = nonlin.boost_lib3.net_booster.Setup(x,y)
# choose normalizer
mylib6.choose_normalizer(name = 'standard')
# choose normalizer
# mylib6.make_train_valid_split(train_portion = 0.66)
mylib6.x_train = mylib9.x_train
mylib6.y_train = mylib9.y_train
mylib6.x_valid = mylib9.x_valid
mylib6.y_valid = mylib9.y_valid
mylib6.train_inds = mylib9.train_inds
mylib6.valid_inds = mylib9.valid_inds
# choose cost|
mylib6.choose_cost(name = 'softmax')
# choose optimizer
mylib6.choose_optimizer('gradient_descent',max_its=5000,alpha_choice=10**(0))
# run boosting
mylib6.boost(num_rounds=30,activation = 'relu')
# animate
frames = 31
anim = nonlin.boosting_classification_animator_v3.Visualizer(csvname)
anim.animate_trainval_boosting(mylib6,frames)
The careful reader will notice how similar the boosting procedure described above is to the one introduced in Section 9.5 in the context of feature selection. Indeed fundamentally the two approaches are almost entirely the same, except here we do not select from a set of given input features but create them ourselves based on a universal approximator. Additionally here instead of our main concern with boosting being human interpretability of a machine learning model, as it was in Section 9.5, here we use boosting as a tool for cross-validation. This means that unless we specifically prohibit it from occuring we can indeed select the same feature multiple times in the boosting process here - that is when applying boosting to the problem of cross-validation employing nonlinear features.
These two use-cases for boosting - feature selection and cross-validation - can occur together, albeit tyipcally in the context of linear modeling as detailed in Section 9.5. Often in such instances cross-validation is used with a linear model as a way of automatically selecting an appropriate number of features / rounds of boosting with human interpretation of the resulting selected features still in mind. On the other hand, rarely is feature selection - in the sense described in Section 9.5 - done when employing a nonlinear model based on features from a universal approximator due to the great difficulty in the human interpretability of nonlinear features. The rare exception to this rule is when using tree-based units which due to their simple structure can - as discussed in Section 13.5 - be more easily interpreted by humans.
Here we describe a common interpretation of boosting in the context of regression, that of sequentially fitting to the 'residual' of a regression dataset. To see what this means let us study a regression cost function - here Least Squares - where we have inserted a boosted model at the $M^{th}$ step of its development
\begin{equation} g\left(\Theta_M^{\,}\right) = \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_M^{\,}\left(\mathbf{x}_p,\Theta_M^{\,}\right) - y_p\right)^2. \end{equation}Suppose this boosted model has been constructed by recursively adding a single unit at each step of the boosting process. Since our boosted model is recursive, we can write equivalently as $\text{model}_M^{\,}\left(\mathbf{x}_p^{\,},\Theta_M^{\,}\right) = \text{model}_{M-1}^{\,}\left(\mathbf{x}_p^{\,},\Theta_{M-1}^{\,}\right) + f_M^{\,}\left(\mathbf{x}_p\right)w_M^{\,}$ where all of the parameters of the $\left(M-1\right)^{th}$ model, $\text{model}_{M-1}$, are already tuned. Examining just the $p^{th}$ summand of the cost function above, notice we can re-write it as
\begin{equation} \left(\text{model}_{M-1}^{\,}\left(\mathbf{x}p^{\,},\Theta{M-1}^{\,}\right) + f_M^{\,}\left(\mathbf{x}_p\right)w_M^{\,} - y_p^{\,} \right)^2 = \left(f_M^{\,}\left(\mathbf{x}_p\right)w_M^{\,} - \left(yp^{\,} - \text{model}{M-1}^{\,}\left(\mathbf{x}p^{\,},\Theta{M-1}^{\,}\right)\right)\right)^2. \end{equation}
On the right hand side we have just re-arranged terms, keeping our term with parameters that still need tuning $f_M\left(\mathbf{x}\right)w_M $ on the left and lumping all of the fixed quantities together - i.e., $y_p - \text{model}_{M-1}$ - on the right. Applying this to each summand of the cost function we can write it equivalently as
\begin{equation} g\left(\Theta_M^{\,}\right) = \frac{1}{P}\sum_{p=1}^{P}\left(f_M^{\,}\left(\mathbf{x}_p^{\,}\right)w_M^{\,} - \left(y_p^{\,} - \text{model}_{M-1}^{\,}\left(\mathbf{x}_p^{\,}\right)\right)\right)^2. \end{equation}By minimizing this error notice we look to tune the parameters of a single additional unit so that
\begin{equation} f_M^{\,}\left(\mathbf{x}_p\right)w_M^{\,}\approx y_p^{\,} - \text{model}_{M-1}^{\,}\left(\mathbf{x}_p^{\,}\right) \,\,\,\,\, p=1,...,P \end{equation}or in other words, so that this fully tuned unit approximates our original output $y_p$ minus the contribution of the previous model $\text{model}_{M-1}^{\,}\left(\mathbf{x}_p^{\,},\Theta_{M-1}^{\,}\right)$. This quantity - the difference between our original output and the contribution of the $\left(M-1\right)^{th}$ model - is often called the residual. It is the 'leftovers', what is left to represent after learning after subtracting off what was learned by the $\left(M-1\right)^{th}$ model.
Below we animate the process of boosting $M = 5000$ single layer tanh units to a one dimensional input regression dataset. In the left panel we show the dataset along with the fit provided by $\text{model}_m$ at the $M^{th}$ step of boosting. In the right panel we plot the residual at the same step, as well as the fit provided by the corresponding $M^{th}$ unit $f_M$. As you pull the slider from left to right the run of boosting progresses, with the fit on the original data improving and while (simultaneously) the residual shrinks.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + 'universal_regression_samples_0.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = copy.deepcopy(data[:-1,:])
y = copy.deepcopy(data[-1:,:] )
# boosting procedure
num_units = 5000
runs2 = []
for j in range(num_units):
# import the v1 library
mylib2 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib2.choose_features(name = 'multilayer_perceptron',layer_sizes = [1,1,1],activation = 'relu',scale = 0.5)
# choose normalizer
mylib2.choose_normalizer(name = 'standard')
# choose cost
mylib2.choose_cost(name = 'least_squares')
# fit an optimization
mylib2.fit(max_its = 10,alpha_choice = 10**(-1))
# add model to list
runs2.append(copy.deepcopy(mylib2))
# cut off output given model
normalizer = mylib2.normalizer
ind = np.argmin(mylib2.cost_histories[0])
weights = mylib2.weight_histories[0][ind]
y_pred = mylib2.model(normalizer(x),weights)
y -= y_pred
# animate the business
frames = 50
demo2 = nonlib.boosting_regression_animators.Visualizer(csvname)
demo2.animate_boosting(runs2,frames)
© This material is not to be distributed, copied, or reused without written permission from the authors.